-
Notifications
You must be signed in to change notification settings - Fork 12.7k
Apple NPU acceleration integrated into llama.cpp, using MiniCPM-V 4.0 as an example. #15262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Apple NPU acceleration integrated into llama.cpp, using MiniCPM-V 4.0 as an example. #15262
Conversation
Feat ios
feat ios: add clean kv cache
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally looks OK. Need to improve encapsulation of the CoreML code (see comments). Would need a review from @ngxson.
Also:
- Use "CoreML" instead of "ANE"
- Would eventually need instructions for generating the CoreML inference code - can add those after the PR is approved
tools/mtmd/clip.h
Outdated
bool ane_embedding(struct clip_ctx * ctx, int n_threads, const struct clip_image_f32_batch * imgs, float * vec); | ||
bool ane_resampler(struct clip_ctx * ctx, int n_threads, const struct clip_image_f32_batch * imgs, const float * vit_embedding, float * vec); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to expose this in the public interface
|
||
// ANE support functions | ||
void clip_set_ane_model_path(struct clip_ctx * ctx, const char * ane_model_path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should find a way to avoid this. Maybe we can do something similar to whisper.cpp
:
@@ -82,6 +82,7 @@ struct mtmd_context_params { | |||
enum ggml_log_level verbosity; | |||
const char * image_marker; // deprecated, use media_marker instead | |||
const char * media_marker; | |||
const char * ane_model_path; // path to ANE model for iOS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of the term "ane", use the term "coreml" as it is more correct. CoreML models can run not only the Apple Neural Engine, but also on the GPU and CPU.
|
||
static int flag = 0; | ||
static const void* coremlEncoder = NULL; | ||
static std::string cached_model_path = ""; | ||
|
||
// Check if we need to load a new model | ||
if (flag == 0 || (ane_model_path && cached_model_path != ane_model_path)) { | ||
if (coremlEncoder) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid this global state. Figure out a way to move this to the clip context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The global idea is good. However, I think we should take time to make sure this can be useful in the long term.
The biggest issue atm is that many TODO
are being copied in the PR, which will make refactoring very difficult in the future. We must resolve this problem first.
Related to UX, if we cannot have embeddings and resampler all in one CoreML model, I think we should separate it into 2 repos on hugging face or modelscope. One having only ggml implementation and one have CoreML. Having everything in the same place seems very confusing for most users, and most of them don't even have time to look at this PR.
static bool ane_embedding(clip_ctx * ctx, const int n_threads, const clip_image_f32_batch * imgs_c_ptr, float * vec) { | ||
const clip_image_f32_batch & imgs = *imgs_c_ptr; | ||
int batch_size = imgs.entries.size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't feel quite comfortable duplicating this function, as you're also duplicating many TODO
, which will make cleaning this up extremely difficult in the future.
We should find a way to merge it with an existing function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand your idea, I will try to modify it.
float * vit_embedding1 = (float *)malloc(1100*1152*sizeof(float)); | ||
float * vit_embedding2 = (float *)malloc(1100*1152*sizeof(float)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should avoid malloc
because we had a lot of mem leaks in the old code base of clip.cpp
. Use std::vector<float>
instead.
ane_embedding(ctx, n_threads, &imgs, vit_embedding1); | ||
clip_image_encode_ane(vit_embedding1, vit_embedding2, ctx->ane_model_path.c_str()); | ||
ane_resampler(ctx, n_threads, &imgs, vit_embedding2, vec); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like only the ViT part is done by ANE, the rest (embeddings, resampler) is sill done by ggml. Any reason why we can't do the rest with ANE? I think it could be a cleaner approach as we can now be able to load only .mlmodelc
file and no more mmproj.gguf
file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, maybe we should try ggml_custom_4d
and inject the clip_image_encode_ane
as a node on ggml cgraph. If that works, it will make everything looks much cleaner. Do you think it's a valid use case of ggml_custom_4d
@ggerganov ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ngxson Yes, only the vit is currently being replaced with ane now.
Because the embed calculations aren't yet correctly calculated with ane, I've bypassed the two embed calculations and only replaced the vit itself.
I'm also still trying other methods to see if there's a solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, maybe we should try ggml_custom_4d and inject the clip_image_encode_ane as a node on ggml cgraph. If that works, it will make everything looks much cleaner. Do you think it's a valid use case of ggml_custom_4d
Haven't considered such use case for ggml_custom_4d
. Sounds like worth exploring.
@ggerganov @ngxson Yes, I understand that introducing a new feature requires more time to discuss its design, including its name, structure, and interface definition. All of this takes time. I have plenty of time to prepare for this. I will follow the discussion and ensure that this feature is incorporated into llama.cpp in a proper manner. |
As stated in #14983, I have integrated Apple NPU (ANE) acceleration into llama.cpp.
Using MiniCPM-V 4.0 as an example, I will introduce a simple way to use ANE and hope we can discuss a better approach.
Download ane in Hugging Face or Modelscope, If you downloaded the zip file, please unzip it.
Used like mmproj, I added the "--ane" interface. The path is the downloaded ane_minicpmv4_vit_f16.mlmodelc file address.
I tested ANE acceleration on several devices. The benchmark results are as follows:
A point worth noting: The first time ANE is used, there is a loading time and it will be slightly slower. After that, as long as ANE is not updated, it will remain ready and waiting in the system.